archived/workshops/llama2-7b-batching-throughput.ipynb

{ "cells": [ { "cell_type": "markdown", "id": "a729a7fc-a1b6-4832-9155-d240ccd8ecc0", "metadata": { "tags": [] }, "source": [ "# Increase throughput for Llama2-7b Model using Batching techniques on SageMaker LMI v5" ] }, { "cell_type": "markdown", "id": "9ee7df30-e340-466c-a8e3-665334ad86d6", "metadata": {}, "source": [ "---\n", "\n", "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.\n", "\n", "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/inference|generativeai|llm-workshop|llama2-7b-batching-throughput|llama2-7b-batching-throughput.ipynb)\n", "\n", "---" ] }, { "cell_type": "markdown", "id": "e2958bcf-767f-4cd7-826e-963f91565876", "metadata": {}, "source": [ "In this notebook, we explore how to use different batching techniques to increase throughput for LLama2-7b large language model on SageMaker using LMI v5 container. We use DJLServing as the model serving solution in this example that is bundled in the LMI container. DJLServing is a high-performance universal model serving solution powered by the Deep Java Library (DJL) that is programming language agnostic. To learn more about DJL and DJLServing, you can refer to this link (https://docs.djl.ai/docs/serving/index.html).\n", "\n", "Batching helps to increase throughput for Generative AI inferencing by combining requests and sending them together to the LLM as a batch. We explore three batching techniques i.e. Dynamic Batching, Continuous Batching and Paged Attention Batching in this notebook and demonstrate the achieved throughput gains.\n", "\n", "We utilize SageMaker LMI v5 container which provides rolling batch capability for Continuous Batching along with Paged Attention. In this notebook, we deploy https://huggingface.co/TheBloke/Llama-2-7B-fp16 model across GPUs on a ml.g5.12xlarge instance." ] }, { "cell_type": "markdown", "id": "37ffcfa8-02c5-4492-af5c-fdcf60f369db", "metadata": {}, "source": [ "### Import required libraries and establish session using SageMaker SDK" ] }, { "cell_type": "code", "execution_count": null, "id": "8dd6a36f-ee95-453d-a127-c8a7de6a026d", "metadata": { "tags": [] }, "outputs": [], "source": [ "!pip install sagemaker boto3 huggingface_hub --upgrade --quiet" ] }, { "cell_type": "code", "execution_count": null, "id": "cf0c89f4-679c-4557-b95d-1d954c15a020", "metadata": { "tags": [] }, "outputs": [], "source": [ "import sagemaker\n", "import jinja2\n", "from sagemaker import image_uris\n", "import boto3\n", "import os\n", "import time\n", "import json\n", "from pathlib import Path" ] }, { "cell_type": "code", "execution_count": null, "id": "561b79ba-7354-4d40-87ce-dc5813092576", "metadata": { "tags": [] }, "outputs": [], "source": [ "role = sagemaker.get_execution_role() # execution role for the endpoint\n", "sess = sagemaker.session.Session() # sagemaker session for interacting with different AWS APIs\n", "bucket = sess.default_bucket() # bucket to house artifacts" ] }, { "cell_type": "code", "execution_count": null, "id": "f68b5181-d018-4564-9762-fa8770a9672f", "metadata": { "tags": [] }, "outputs": [], "source": [ "model_bucket = sess.default_bucket() # bucket to house model artifacts\n", "s3_code_prefix = \"hf-large-model-djl/meta-llama/Llama-2-7b-fp16/code\" # folder within bucket where code artifact will go\n", "\n", "s3_model_prefix = \"hf-large-model-djl/meta-llama/Llama-2-7b-fp16/model\" # folder within bucket where model artifact will go\n", "region = sess._region_name\n", "account_id = sess.account_id()\n", "\n", "s3_client = boto3.client(\"s3\")\n", "sm_client = boto3.client(\"sagemaker\")\n", "smr_client = boto3.client(\"sagemaker-runtime\")\n", "\n", "jinja_env = jinja2.Environment()" ] }, { "cell_type": "markdown", "id": "959e1413-76f8-4b01-88b5-1962c012d438", "metadata": {}, "source": [ "### [OPTIONAL] Download the model from Hugging Face and upload the model artifacts on Amazon S3\n", "\n", "If you intend to download your copy of the model and upload it to a s3 location in your AWS account, please follow the below steps, else you can skip to the next step." ] }, { "cell_type": "code", "execution_count": null, "id": "94af859c-4c3a-4fda-ae27-890be565a906", "metadata": { "tags": [] }, "outputs": [], "source": [ "\"\"\"from huggingface_hub import snapshot_download\n", "from pathlib import Path\n", "import os\n", "\n", "# - This will download the model into the current directory where ever the jupyter notebook is running\n", "local_model_path = Path(\".\")\n", "local_model_path.mkdir(exist_ok=True)\n", "model_name = \"TheBloke/Llama-2-7b-fp16\"\n", "# Only download pytorch checkpoint files\n", "allow_patterns = [\"*.json\", \"*.txt\", \"*.model\", \"*.safetensors\", \"*.bin\", \"*.chk\", \"*.pth\"]\n", "\n", "# - Leverage the snapshot library to donload the model since the model is stored in repository using LFS\n", "model_download_path = snapshot_download(\n", " repo_id=model_name, cache_dir=local_model_path, allow_patterns=allow_patterns\n", ")\"\"\"" ] }, { "cell_type": "code", "execution_count": null, "id": "6f355f9f-69e2-4c1e-a467-5c5520a9b142", "metadata": { "tags": [] }, "outputs": [], "source": [ "# upload files from local to S3 location\n", "# pretrained_model_location = sess.upload_data(path=model_download_path, key_prefix=s3_model_prefix)\n", "# print(f\"Model uploaded to --- > {pretrained_model_location}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "fa63b0eb-1d2b-4cfd-98c8-c7611878518d", "metadata": {}, "outputs": [], "source": [ "# Cleanup locally stored model files post S3 upload\n", "#!rm -rf {model_download_path}" ] }, { "cell_type": "markdown", "id": "7d9697ac-e79b-4159-8c44-ffb6bcd8b7da", "metadata": {}, "source": [ "### Define a variable to contain the s3 url of the location that has the model" ] }, { "cell_type": "code", "execution_count": null, "id": "76bc629a-c817-4bd8-be42-a4f30d0b075d", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Define a variable to contain the s3 url of the location that has the model. For demo purpose, we use Llama-2-7b-fp16 model artifacts from our S3 bucket\n", "pretrained_model_location = f\"s3://sagemaker-example-files-prod-{region}/models/llama-2/fp16/7B/\"" ] }, { "cell_type": "markdown", "id": "9913de41-193a-4bfb-b42e-581c7677d0ed", "metadata": {}, "source": [ "## Deploy 3 endpoints for benchmarking with settings to enable different Batching techniques\n", "\n", "We will deploy 3 different endpoints for benchmarking with different Batching techniques as below:\n", "\n", "- Dynamic Batching\n", "- Continuous Batching\n", "- Paged Attention Batching" ] }, { "cell_type": "markdown", "id": "b50f5d44-6388-49fa-bc87-2e82bed226f3", "metadata": {}, "source": [ "### 1. Dynamic Batching\n", "#### 1.1 Create serving.properties for Dynamic Batching\n", "This is a configuration file to indicate to DJL Serving which model parallelization and inference optimization libraries you would like to use. Depending on your need, you can set the appropriate configuration.\n", "\n", "Here is a list of settings that we use in this configuration file -\n", "\n", " engine: The engine for DJL to use. In this case, we have set it to Python (Dynamic Batching) or MPI (Continuous and Paged Attention Batching).\n", " option.model_id: The model ID of a pretrained model hosted inside a model repository on huggingface.co (https://huggingface.co/models) or S3 path to the model artifacts. \n", " option.tensor_parallel_degree: Set to the number of GPU devices over which Accelerate needs to partition the model. This parameter also controls the no of workers per model which will be started up when DJL serving runs. As an example if we have a 8 GPU machine and we are creating 8 partitions then we will have 1 worker per model to serve the requests.\n", "\n", "For more details on the configuration options and an exhaustive list, you can refer the documentation - https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-configuration.html." ] }, { "cell_type": "code", "execution_count": null, "id": "c2691f5b-4e98-4f7a-887c-49c05bbf7a8e", "metadata": { "tags": [] }, "outputs": [], "source": [ "!rm -rf code_llama2_7b_fp16\n", "!mkdir -p code_llama2_7b_fp16" ] }, { "cell_type": "code", "execution_count": null, "id": "42b040a8-2180-46cd-aa5c-b1f6d50c2dcf", "metadata": { "tags": [] }, "outputs": [], "source": [ "%%writefile code_llama2_7b_fp16/serving.properties\n", "engine = Python\n", "option.entryPoint = djl_python.huggingface\n", "option.tensor_parallel_degree = 2\n", "batch_size = 64\n", "max_batch_delay = 1000\n", "option.model_loading_timeout = 900\n", "option.model_id = {{model_id}}" ] }, { "cell_type": "code", "execution_count": null, "id": "996ebeb2-cfee-4e7b-af8f-3ccc811fa1eb", "metadata": { "tags": [] }, "outputs": [], "source": [ "# we plug in the appropriate model location into our `serving.properties`\n", "template = jinja_env.from_string(Path(\"code_llama2_7b_fp16/serving.properties\").open().read())\n", "Path(\"code_llama2_7b_fp16/serving.properties\").open(\"w\").write(\n", " template.render(model_id=pretrained_model_location)\n", ")\n", "!pygmentize code_llama2_7b_fp16/serving.properties | cat -n" ] }, { "cell_type": "markdown", "id": "ec9db9c4-5023-4125-a413-7c5afa135218", "metadata": {}, "source": [ "**Image URI for the DJL container is being used here**" ] }, { "cell_type": "code", "execution_count": null, "id": "8dfabe0a-04f8-486d-94ab-7d6066680954", "metadata": { "tags": [] }, "outputs": [], "source": [ "inference_image_uri = image_uris.retrieve(\n", " framework=\"djl-deepspeed\", region=region, version=\"0.23.0\"\n", ")\n", "print(f\"Image going to be used is ---- > {inference_image_uri}\")" ] }, { "cell_type": "markdown", "id": "905903de-a4b9-4f41-8cc2-564416ae5d5f", "metadata": {}, "source": [ "**Create the Tarball and then upload to S3 location**" ] }, { "cell_type": "code", "execution_count": null, "id": "c005aa2e-ec1a-4ccf-8f39-67bcbabd0309", "metadata": { "tags": [] }, "outputs": [], "source": [ "!rm model.tar.gz\n", "!tar czvf model.tar.gz code_llama2_7b_fp16" ] }, { "cell_type": "code", "execution_count": null, "id": "bb83ba3b-2ea5-4297-8e85-f16dd4c7c13a", "metadata": { "tags": [] }, "outputs": [], "source": [ "s3_code_artifact = sess.upload_data(\"model.tar.gz\", bucket, s3_code_prefix)" ] }, { "cell_type": "markdown", "id": "efa56b75-356a-4ebf-bebd-7150521c95e9", "metadata": {}, "source": [ "#### 1.2 Deploy endpoint for Dynamic Batching" ] }, { "cell_type": "code", "execution_count": null, "id": "eac40dc4-c9c3-4af1-8d54-7b6742259266", "metadata": { "tags": [] }, "outputs": [], "source": [ "from sagemaker.utils import name_from_base\n", "\n", "model_name = name_from_base(f\"Llama-2-7b-fp16-mpi\")\n", "print(model_name)\n", "\n", "create_model_response = sm_client.create_model(\n", " ModelName=model_name,\n", " ExecutionRoleArn=role,\n", " PrimaryContainer={\"Image\": inference_image_uri, \"ModelDataUrl\": s3_code_artifact},\n", ")\n", "model_arn = create_model_response[\"ModelArn\"]\n", "\n", "print(f\"Created Model: {model_arn}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "332e4a25-233f-4e75-aabb-0986c6c48f77", "metadata": { "tags": [] }, "outputs": [], "source": [ "endpoint_config_name = f\"{model_name}-config\"\n", "endpoint_name = f\"{model_name}-endpoint\"\n", "\n", "endpoint_config_response = sm_client.create_endpoint_config(\n", " EndpointConfigName=endpoint_config_name,\n", " ProductionVariants=[\n", " {\n", " \"VariantName\": \"variant1\",\n", " \"ModelName\": model_name,\n", " \"InstanceType\": \"ml.g5.12xlarge\",\n", " \"InitialInstanceCount\": 1,\n", " \"ModelDataDownloadTimeoutInSeconds\": 900,\n", " \"ContainerStartupHealthCheckTimeoutInSeconds\": 900,\n", " },\n", " ],\n", ")\n", "endpoint_config_response" ] }, { "cell_type": "code", "execution_count": null, "id": "4b6a0ca5-3ee6-4793-83f4-d9565a7d06e5", "metadata": { "tags": [] }, "outputs": [], "source": [ "create_endpoint_response = sm_client.create_endpoint(\n", " EndpointName=f\"{endpoint_name}\", EndpointConfigName=endpoint_config_name\n", ")\n", "print(f\"Created Endpoint: {create_endpoint_response['EndpointArn']}\")" ] }, { "cell_type": "markdown", "id": "32c013d1-1460-4592-b9be-161951ed9ba0", "metadata": { "tags": [] }, "source": [ "#### Wait for endpoint to be In-service. This can take a while, so please be patient" ] }, { "cell_type": "code", "execution_count": null, "id": "2ace9f1d-c681-425c-acfb-102cfc079fff", "metadata": { "tags": [] }, "outputs": [], "source": [ "import time\n", "\n", "resp = sm_client.describe_endpoint(EndpointName=endpoint_name)\n", "status = resp[\"EndpointStatus\"]\n", "print(\"Status: \" + status)\n", "\n", "while status == \"Creating\":\n", " time.sleep(60)\n", " resp = sm_client.describe_endpoint(EndpointName=endpoint_name)\n", " status = resp[\"EndpointStatus\"]\n", " print(\"Status: \" + status)\n", "\n", "print(\"Arn: \" + resp[\"EndpointArn\"])\n", "print(\"Status: \" + status)" ] }, { "cell_type": "markdown", "id": "8cc0f36d-a23a-4ed5-91f2-6770723c20e3", "metadata": { "tags": [] }, "source": [ "### 2. Continuous Batching\n", "#### 2.1 Create serving.properties for Continuous Batching" ] }, { "cell_type": "code", "execution_count": null, "id": "939d6ee0-2093-43b5-87bc-c5b14fdc2bdd", "metadata": { "tags": [] }, "outputs": [], "source": [ "!rm -rf code_llama2_7b_fp16\n", "!mkdir -p code_llama2_7b_fp16" ] }, { "cell_type": "code", "execution_count": null, "id": "b56e1aab-ad03-4928-96cc-3f84b75b0468", "metadata": { "tags": [] }, "outputs": [], "source": [ "%%writefile code_llama2_7b_fp16/serving.properties\n", "engine = MPI\n", "option.entryPoint = djl_python.huggingface\n", "option.rolling_batch = auto\n", "option.max_rolling_batch_size = 64\n", "option.paged_attention = false\n", "option.max_rolling_batch_prefill_tokens = 16080\n", "option.tensor_parallel_degree = 2\n", "option.model_loading_timeout = 900\n", "option.model_id = {{model_id}}" ] }, { "cell_type": "code", "execution_count": null, "id": "fe0ec476-cf2f-428e-9dc5-9e3fef0dc9f6", "metadata": { "tags": [] }, "outputs": [], "source": [ "# we plug in the appropriate model location into our `serving.properties`\n", "template = jinja_env.from_string(Path(\"code_llama2_7b_fp16/serving.properties\").open().read())\n", "Path(\"code_llama2_7b_fp16/serving.properties\").open(\"w\").write(\n", " template.render(model_id=pretrained_model_location)\n", ")\n", "!pygmentize code_llama2_7b_fp16/serving.properties | cat -n" ] }, { "cell_type": "code", "execution_count": null, "id": "80656021-4ef6-403b-a74f-48fddccf0489", "metadata": { "tags": [] }, "outputs": [], "source": [ "inference_image_uri = image_uris.retrieve(\n", " framework=\"djl-deepspeed\", region=region, version=\"0.23.0\"\n", ")\n", "print(f\"Image going to be used is ---- > {inference_image_uri}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "429265b3-f867-4c29-bfb2-261a2f1228cc", "metadata": { "tags": [] }, "outputs": [], "source": [ "!rm model.tar.gz\n", "!tar czvf model.tar.gz code_llama2_7b_fp16" ] }, { "cell_type": "code", "execution_count": null, "id": "a5036049-1b5c-43b1-a2d3-71bcebe47843", "metadata": { "tags": [] }, "outputs": [], "source": [ "s3_code_artifact = sess.upload_data(\"model.tar.gz\", bucket, s3_code_prefix)" ] }, { "cell_type": "markdown", "id": "343ca23b-fd40-4213-810e-4b8a08300693", "metadata": {}, "source": [ "#### 2.2 Deploy endpoint for Continuous Batching" ] }, { "cell_type": "code", "execution_count": null, "id": "c3555478-c9a4-4d84-9d73-b9e053b24da9", "metadata": { "tags": [] }, "outputs": [], "source": [ "from sagemaker.utils import name_from_base\n", "\n", "model_name = name_from_base(f\"Llama-2-7b-fp16-mpi\")\n", "print(model_name)\n", "\n", "create_model_response = sm_client.create_model(\n", " ModelName=model_name,\n", " ExecutionRoleArn=role,\n", " PrimaryContainer={\"Image\": inference_image_uri, \"ModelDataUrl\": s3_code_artifact},\n", ")\n", "model_arn = create_model_response[\"ModelArn\"]\n", "\n", "print(f\"Created Model: {model_arn}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "16a65c03-7dc6-47f6-b8fa-c0d3ff7c1626", "metadata": { "tags": [] }, "outputs": [], "source": [ "endpoint_config_name = f\"{model_name}-config\"\n", "endpoint_name = f\"{model_name}-endpoint\"\n", "\n", "endpoint_config_response = sm_client.create_endpoint_config(\n", " EndpointConfigName=endpoint_config_name,\n", " ProductionVariants=[\n", " {\n", " \"VariantName\": \"variant1\",\n", " \"ModelName\": model_name,\n", " \"InstanceType\": \"ml.g5.12xlarge\",\n", " \"InitialInstanceCount\": 1,\n", " \"ModelDataDownloadTimeoutInSeconds\": 900,\n", " \"ContainerStartupHealthCheckTimeoutInSeconds\": 900,\n", " },\n", " ],\n", ")\n", "endpoint_config_response" ] }, { "cell_type": "code", "execution_count": null, "id": "226e5e11-7889-457e-9215-aaa19602ef0f", "metadata": { "tags": [] }, "outputs": [], "source": [ "create_endpoint_response = sm_client.create_endpoint(\n", " EndpointName=f\"{endpoint_name}\", EndpointConfigName=endpoint_config_name\n", ")\n", "print(f\"Created Endpoint: {create_endpoint_response['EndpointArn']}\")" ] }, { "cell_type": "markdown", "id": "ab6d03f1-fa32-4015-881b-b5bd7ef55243", "metadata": { "tags": [] }, "source": [ "#### This can take a while, so please be patient" ] }, { "cell_type": "code", "execution_count": null, "id": "2cc230b1-cbc5-45ac-aaf8-4da71a1eba64", "metadata": { "tags": [] }, "outputs": [], "source": [ "import time\n", "\n", "resp = sm_client.describe_endpoint(EndpointName=endpoint_name)\n", "status = resp[\"EndpointStatus\"]\n", "print(\"Status: \" + status)\n", "\n", "while status == \"Creating\":\n", " time.sleep(60)\n", " resp = sm_client.describe_endpoint(EndpointName=endpoint_name)\n", " status = resp[\"EndpointStatus\"]\n", " print(\"Status: \" + status)\n", "\n", "print(\"Arn: \" + resp[\"EndpointArn\"])\n", "print(\"Status: \" + status)" ] }, { "cell_type": "markdown", "id": "15284594-50f1-4d41-a5d6-59cba32811a7", "metadata": { "tags": [] }, "source": [ "### 3. Paged Attention Batching\n", "#### 3.1 Create serving.properties for Paged Attention Batching" ] }, { "cell_type": "code", "execution_count": null, "id": "625d73bd-9349-4974-bb89-c5c5169e0f6f", "metadata": { "tags": [] }, "outputs": [], "source": [ "!rm -rf code_llama2_7b_fp16\n", "!mkdir -p code_llama2_7b_fp16" ] }, { "cell_type": "code", "execution_count": null, "id": "58417e12-e16d-42e4-82a9-7c4ab659ece5", "metadata": { "tags": [] }, "outputs": [], "source": [ "%%writefile code_llama2_7b_fp16/serving.properties\n", "engine = MPI\n", "option.entryPoint = djl_python.huggingface\n", "option.rolling_batch = auto\n", "option.max_rolling_batch_size = 64\n", "option.paged_attention = true\n", "option.max_rolling_batch_prefill_tokens = 16080\n", "option.tensor_parallel_degree = 2\n", "option.model_loading_timeout = 900\n", "option.model_id = {{model_id}}" ] }, { "cell_type": "code", "execution_count": null, "id": "969a9814-6ec8-4ac1-98d7-95ec687b98ff", "metadata": { "tags": [] }, "outputs": [], "source": [ "# we plug in the appropriate model location into our `serving.properties`\n", "template = jinja_env.from_string(Path(\"code_llama2_7b_fp16/serving.properties\").open().read())\n", "Path(\"code_llama2_7b_fp16/serving.properties\").open(\"w\").write(\n", " template.render(model_id=pretrained_model_location)\n", ")\n", "!pygmentize code_llama2_7b_fp16/serving.properties | cat -n" ] }, { "cell_type": "code", "execution_count": null, "id": "8c5a8c3f-3977-4607-83a0-527bb6b0b8fd", "metadata": { "tags": [] }, "outputs": [], "source": [ "inference_image_uri = image_uris.retrieve(\n", " framework=\"djl-deepspeed\", region=region, version=\"0.23.0\"\n", ")\n", "print(f\"Image going to be used is ---- > {inference_image_uri}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "e731bf06-f238-4263-bcb4-a35296d8c94d", "metadata": { "tags": [] }, "outputs": [], "source": [ "!rm model.tar.gz\n", "!tar czvf model.tar.gz code_llama2_7b_fp16" ] }, { "cell_type": "code", "execution_count": null, "id": "83e4442b-150c-4622-8e8b-549d269af660", "metadata": { "tags": [] }, "outputs": [], "source": [ "s3_code_artifact = sess.upload_data(\"model.tar.gz\", bucket, s3_code_prefix)" ] }, { "cell_type": "markdown", "id": "77d09b0f-41cc-4a7d-bd05-ab94d5d9ade3", "metadata": {}, "source": [ "#### 3.2 Deploy endpoint for Paged Attention Batching" ] }, { "cell_type": "code", "execution_count": null, "id": "c5f6a0ce-142b-4948-94f8-f64f7287f467", "metadata": { "tags": [] }, "outputs": [], "source": [ "from sagemaker.utils import name_from_base\n", "\n", "model_name = name_from_base(f\"Llama-2-7b-fp16-mpi\")\n", "print(model_name)\n", "\n", "create_model_response = sm_client.create_model(\n", " ModelName=model_name,\n", " ExecutionRoleArn=role,\n", " PrimaryContainer={\"Image\": inference_image_uri, \"ModelDataUrl\": s3_code_artifact},\n", ")\n", "model_arn = create_model_response[\"ModelArn\"]\n", "\n", "print(f\"Created Model: {model_arn}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "6263256d-ffaf-4ddf-8121-e8cccc56cc36", "metadata": { "tags": [] }, "outputs": [], "source": [ "endpoint_config_name = f\"{model_name}-config\"\n", "endpoint_name = f\"{model_name}-endpoint\"\n", "\n", "endpoint_config_response = sm_client.create_endpoint_config(\n", " EndpointConfigName=endpoint_config_name,\n", " ProductionVariants=[\n", " {\n", " \"VariantName\": \"variant1\",\n", " \"ModelName\": model_name,\n", " \"InstanceType\": \"ml.g5.12xlarge\",\n", " \"InitialInstanceCount\": 1,\n", " \"ModelDataDownloadTimeoutInSeconds\": 900,\n", " \"ContainerStartupHealthCheckTimeoutInSeconds\": 900,\n", " },\n", " ],\n", ")\n", "endpoint_config_response" ] }, { "cell_type": "code", "execution_count": null, "id": "d1f4abca-8f6c-470a-a36d-2382c8f12f46", "metadata": { "tags": [] }, "outputs": [], "source": [ "create_endpoint_response = sm_client.create_endpoint(\n", " EndpointName=f\"{endpoint_name}\", EndpointConfigName=endpoint_config_name\n", ")\n", "print(f\"Created Endpoint: {create_endpoint_response['EndpointArn']}\")" ] }, { "cell_type": "markdown", "id": "ee04ff53-74dd-4231-9763-9071ee1f1a08", "metadata": { "tags": [] }, "source": [ "#### This can take a while, so please be patient" ] }, { "cell_type": "code", "execution_count": null, "id": "d36ea00c-5d7a-41b5-b592-cac5cd19741d", "metadata": { "tags": [] }, "outputs": [], "source": [ "import time\n", "\n", "resp = sm_client.describe_endpoint(EndpointName=endpoint_name)\n", "status = resp[\"EndpointStatus\"]\n", "print(\"Status: \" + status)\n", "\n", "while status == \"Creating\":\n", " time.sleep(60)\n", " resp = sm_client.describe_endpoint(EndpointName=endpoint_name)\n", " status = resp[\"EndpointStatus\"]\n", " print(\"Status: \" + status)\n", "\n", "print(\"Arn: \" + resp[\"EndpointArn\"])\n", "print(\"Status: \" + status)" ] }, { "cell_type": "markdown", "id": "0902897d-8c8e-4342-901e-e5da20ef7196", "metadata": { "tags": [] }, "source": [ "### Benchmark your model \n", "\n", "This is a generative model, so we pass in a Text as a prompt and the Model will complete the sentence and return the results. We will use awscurl command line tool to try it (the awscutl command line tool requires java)\n", "\n", "We pass a multiple prompts as input to the model. This is done by downloading below benchmarking tool and setting up the desired performance test env for running tests.\n", "\n", "**Following steps need to be run in a studio terminal or EC2 or Notebook instance**. In below example we use a concurrency of 50 clients sending 100 requests from a g5.12xlarge studio notebook terminal, however if you are looking to run lower/higher concurrency benchmarking then feel free to use a lower/higher compute instance to run the concurrent benchmark tests." ] }, { "cell_type": "markdown", "id": "9ab53547-c094-4486-be48-2f20e112f04c", "metadata": { "tags": [] }, "source": [ "#### Install Java using the shell" ] }, { "cell_type": "code", "execution_count": null, "id": "41e4ed9e-22d0-484d-b5e4-6b8a1aad5111", "metadata": { "tags": [] }, "outputs": [], "source": [ "!sudo yum -y update\n", "!sudo yum -y install java wget" ] }, { "cell_type": "markdown", "id": "4428685a-0198-400a-9b89-6e59523cfc60", "metadata": {}, "source": [ "#### Download the benchmarking tool" ] }, { "cell_type": "code", "execution_count": null, "id": "87af33a7-ca98-4579-b210-8c9926b46b8e", "metadata": {}, "outputs": [], "source": [ "!wget https://github.com/frankfliu/junkyard/releases/download/v0.3.1/awscurl\n", "!chmod +x awscurl" ] }, { "cell_type": "markdown", "id": "e7810eb8-14c5-4ebf-b1ba-72bffcced698", "metadata": {}, "source": [ "#### Create prompts to be used with Dynamic Batching" ] }, { "cell_type": "code", "execution_count": null, "id": "51055b04-67e7-435a-b075-beb372694981", "metadata": {}, "outputs": [], "source": [ "!mkdir dyn_prompts\n", "!echo '{\"inputs\":\"The diamondback terrapin or simply terrapin is a species of turtle native to the brackish coastal tidal marshes of the\",\"parameters\":{\"min_new_tokens\":128, \"max_new_tokens\":128, \"do_sample\":true}}' > dyn_prompts/dyn_prompt1.txt\n", "!echo '{\"inputs\":\"Write a program to add two numbers in python\",\"parameters\":{\"min_new_tokens\":128, \"max_new_tokens\":128, \"do_sample\":true}}' > dyn_prompts/dyn_prompt2.txt\n", "!echo '{\"inputs\":\"Generate a list of ten titles for my book. The book is about my travels around the world, experiencing different cultures and cuisines, meeting many different personalities and finally settling in a remote village in the himalayas\",\"parameters\":{\"min_new_tokens\":128, \"max_new_tokens\":128, \"do_sample\":true}}' > dyn_prompts/dyn_prompt3.txt" ] }, { "cell_type": "markdown", "id": "9f265903-cb52-46f6-a21f-2dc49ffcc7a5", "metadata": {}, "source": [ "#### Set up credentials using env vars or use .aws/credentials file" ] }, { "cell_type": "code", "execution_count": null, "id": "e0a17a34-c7c4-4b81-b0d1-8f3a6340f3e0", "metadata": {}, "outputs": [], "source": [ "!export AWS_ACCESS_KEY_ID=<Update here>\n", "!export AWS_SECRET_ACCESS_KEY=<Update here>\n", "!export AWS_SESSION_TOKEN=<Update here>" ] }, { "cell_type": "markdown", "id": "c36e82ed-f231-4d52-bc9e-8438edd660e6", "metadata": {}, "source": [ "## Dynamic Batching Benchmarking\n", "\n", "We run the benchmarking tool to get the results for Dynamic Batching. You can change the concurrency through -c and number of requests through -N, however please ensure that you are using an instance with enough compute to run the test and endpoint is deployed on a capable instance to handle the concurrency. Run awscurl -h to get more help on the benchmark tool." ] }, { "cell_type": "code", "execution_count": null, "id": "fe3bf9a1-2370-431f-a009-ee0b583472ff", "metadata": {}, "outputs": [], "source": [ "!EXCLUDE_INPUT_TOKEN=1 TOKENIZER=TheBloke/Llama-2-7b-fp16 ./awscurl -c 50 -N 100 -X POST INSERTDYNAMICENDPOINTURLHERE --connect-timeout 60 -H \"Content-Type: application/json\" --dataset dyn_prompts -t -n sagemaker" ] }, { "cell_type": "markdown", "id": "ae49bf71-90c5-46fb-b893-400961bbb80d", "metadata": { "tags": [] }, "source": [ "#### After the benchmark tests are completed, we will see the results in below format. Below values are sample and actual results will vary based on env setup.\n", "\n", "- Total time: 25073.52 ms.\n", "- Non 200 responses: 0, error rate: 0.00\n", "- Concurrent clients: 2\n", "- Total requests: 4\n", "- TPS: 0.16/s\n", "- Total token: 512\n", "- token/req: 128\n", "- token/s: 20.42/s\n", "- Average Latency: 12359.70 ms.\n", "- P50: 14058.52 ms.\n", "- P90: 14059.23 ms.\n", "- P99: 14059.23 ms.\n", "\n", "Please make a note of the actual values for later comparison" ] }, { "cell_type": "markdown", "id": "583b130c-e219-4af9-a3dd-d12b64f32a35", "metadata": {}, "source": [ "## Continuous Batching Benchmarking\n", "\n", "We run the benchmarking tool to get the results for Continuous Batching." ] }, { "cell_type": "code", "execution_count": null, "id": "e11641cb-9e85-4cc6-bf76-6006204456f5", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Set up prompts to be used for Continuous and Paged Attention Batching\n", "!mkdir prompts\n", "!echo '{\"inputs\":\"The diamondback terrapin or simply terrapin is a species of turtle native to the brackish coastal tidal marshes of the\",\"parameters\":{\"min_new_tokens\":64, \"max_new_tokens\":64, \"do_sample\":true}}' > prompts/prompt1.txt\n", "!echo '{\"inputs\":\"Write a program to add two numbers in python\",\"parameters\":{\"min_new_tokens\":256, \"max_new_tokens\":256, \"do_sample\":true}}' > prompts/prompt2.txt\n", "!echo '{\"inputs\":\"Generate a list of ten titles for my book. The book is about my travels around the world, experiencing different cultures and cuisines, meeting many different personalities and finally settling in a remote village in the hills\",\"parameters\":{\"min_new_tokens\":128, \"max_new_tokens\":128, \"do_sample\":true}}' > prompts/prompt3.txt" ] }, { "cell_type": "code", "execution_count": null, "id": "3b239f5b-15c3-4fc3-a354-f9bfcfdacacf", "metadata": { "tags": [] }, "outputs": [], "source": [ "!TOKENIZER=TheBloke/Llama-2-7b-fp16 ./awscurl -c 2 -N 2 -X POST INSERTCONTINUOUSENDPOINTURLHERE --connect-timeout 60 -H \"Content-Type: application/json\" --dataset prompts -t -n sagemaker" ] }, { "cell_type": "markdown", "id": "0b1b490a-0b1b-43a0-876d-843d1789dff3", "metadata": { "tags": [] }, "source": [ "Please make a note of actual values from the benchmark tests for later comparison" ] }, { "cell_type": "markdown", "id": "d589cf1c-a88e-4bd2-9e61-7e1904107d4a", "metadata": {}, "source": [ "## Benchmarking throughput for Paged Attention Batching\n", "\n", "We run the benchmarking tool to get the results for Paged Attention Batching." ] }, { "cell_type": "code", "execution_count": null, "id": "9f7c25e3-b6d4-4a4e-8488-cb0880f83954", "metadata": { "tags": [] }, "outputs": [], "source": [ "!TOKENIZER=TheBloke/Llama-2-7b-fp16 ./awscurl -c 2 -N 2 -X POST INSERTPAGEDENDPOINTURLHERE --connect-timeout 60 -H \"Content-Type: application/json\" --dataset prompts -t -n sagemaker" ] }, { "cell_type": "markdown", "id": "6c9edc1c-8a8f-4187-b345-016f8237c4e2", "metadata": {}, "source": [ "Please make a note of actual values from the benchmark tests for later comparison" ] }, { "cell_type": "markdown", "id": "a665d5f7-21e8-4b08-a8f3-e3ba353f4fc5", "metadata": { "tags": [] }, "source": [ "## We can now compare the throughput results e.g. Token/s, TPS for different batching techniques to review the throughout gains achieved by using Continuous and Paged Attention Batching over Dynamic Batching." ] }, { "cell_type": "markdown", "id": "3b08acc7-4535-421d-bfe5-1a3f1cb2a2d5", "metadata": {}, "source": [ "\n", "| Model | Batching strategy | TPS | Token/s |\n", "|-------|-------------------------|----------|--------------|\n", "|llama2-7b | Dynamic Batching | 2.62 | 336.28 |\n", "|llama2-7b | Continuous Batching | 6.14 | 849.48 |\n", "|llama2-7b | PagedAttention Batching | 6.29 | 889.68 |" ] }, { "cell_type": "markdown", "id": "38886de1-a30b-41e3-b99d-b4454e37760b", "metadata": {}, "source": [ "## Clean Up\n", "Delete the resources (Endpoint, Endpoint config, Model) deployed for the 3 endpoints used in above tests." ] }, { "cell_type": "markdown", "id": "32785d76-a4f1-40a0-8632-0de42cf283e0", "metadata": {}, "source": [ "## Notebook CI Test Results\n", "\n", "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n", "\n", "\n", "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/inference|generativeai|llm-workshop|llama2-7b-batching-throughput|llama2-7b-batching-throughput.ipynb)\n", "\n", "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/inference|generativeai|llm-workshop|llama2-7b-batching-throughput|llama2-7b-batching-throughput.ipynb)\n", "\n", "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/inference|generativeai|llm-workshop|llama2-7b-batching-throughput|llama2-7b-batching-throughput.ipynb)\n", "\n", "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/inference|generativeai|llm-workshop|llama2-7b-batching-throughput|llama2-7b-batching-throughput.ipynb)\n", "\n", "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/inference|generativeai|llm-workshop|llama2-7b-batching-throughput|llama2-7b-batching-throughput.ipynb)\n", "\n", "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/inference|generativeai|llm-workshop|llama2-7b-batching-throughput|llama2-7b-batching-throughput.ipynb)\n", "\n", "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/inference|generativeai|llm-workshop|llama2-7b-batching-throughput|llama2-7b-batching-throughput.ipynb)\n", "\n", "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/inference|generativeai|llm-workshop|llama2-7b-batching-throughput|llama2-7b-batching-throughput.ipynb)\n", "\n", "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/inference|generativeai|llm-workshop|llama2-7b-batching-throughput|llama2-7b-batching-throughput.ipynb)\n", "\n", "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/inference|generativeai|llm-workshop|llama2-7b-batching-throughput|llama2-7b-batching-throughput.ipynb)\n", "\n", "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/inference|generativeai|llm-workshop|llama2-7b-batching-throughput|llama2-7b-batching-throughput.ipynb)\n", "\n", "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/inference|generativeai|llm-workshop|llama2-7b-batching-throughput|llama2-7b-batching-throughput.ipynb)\n", "\n", "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/inference|generativeai|llm-workshop|llama2-7b-batching-throughput|llama2-7b-batching-throughput.ipynb)\n", "\n", "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/inference|generativeai|llm-workshop|llama2-7b-batching-throughput|llama2-7b-batching-throughput.ipynb)\n", "\n", "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/inference|generativeai|llm-workshop|llama2-7b-batching-throughput|llama2-7b-batching-throughput.ipynb)\n" ] } ], "metadata": { "availableInstances": [ { "_defaultOrder": 0, "_isFastLaunch": true, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 4, "name": "ml.t3.medium", "vcpuNum": 2 }, { "_defaultOrder": 1, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 8, "name": "ml.t3.large", "vcpuNum": 2 }, { "_defaultOrder": 2, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.t3.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 3, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.t3.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 4, "_isFastLaunch": true, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 8, "name": "ml.m5.large", "vcpuNum": 2 }, { "_defaultOrder": 5, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.m5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 6, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.m5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 7, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.m5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 8, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.m5.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 9, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.m5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 10, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.m5.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 11, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 384, "name": "ml.m5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 12, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 8, "name": "ml.m5d.large", "vcpuNum": 2 }, { "_defaultOrder": 13, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.m5d.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 14, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.m5d.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 15, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.m5d.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 16, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.m5d.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 17, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.m5d.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 18, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.m5d.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 19, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 384, "name": "ml.m5d.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 20, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": true, "memoryGiB": 0, "name": "ml.geospatial.interactive", "supportedImageNames": [ "sagemaker-geospatial-v1-0" ], "vcpuNum": 0 }, { "_defaultOrder": 21, "_isFastLaunch": true, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 4, "name": "ml.c5.large", "vcpuNum": 2 }, { "_defaultOrder": 22, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 8, "name": "ml.c5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 23, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.c5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 24, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.c5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 25, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 72, "name": "ml.c5.9xlarge", "vcpuNum": 36 }, { "_defaultOrder": 26, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 96, "name": "ml.c5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 27, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 144, "name": "ml.c5.18xlarge", "vcpuNum": 72 }, { "_defaultOrder": 28, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.c5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 29, "_isFastLaunch": true, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.g4dn.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 30, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.g4dn.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 31, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.g4dn.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 32, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.g4dn.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 33, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.g4dn.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 34, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.g4dn.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 35, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 61, "name": "ml.p3.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 36, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "hideHardwareSpecs": false, "memoryGiB": 244, "name": "ml.p3.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 37, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 488, "name": "ml.p3.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 38, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 768, "name": "ml.p3dn.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 39, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.r5.large", "vcpuNum": 2 }, { "_defaultOrder": 40, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.r5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 41, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.r5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 42, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.r5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 43, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.r5.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 44, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 384, "name": "ml.r5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 45, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 512, "name": "ml.r5.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 46, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 768, "name": "ml.r5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 47, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.g5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 48, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.g5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 49, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.g5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 50, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.g5.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 51, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.g5.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 52, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.g5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 53, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "hideHardwareSpecs": false, "memoryGiB": 384, "name": "ml.g5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 54, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 768, "name": "ml.g5.48xlarge", "vcpuNum": 192 }, { "_defaultOrder": 55, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 1152, "name": "ml.p4d.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 56, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 1152, "name": "ml.p4de.24xlarge", "vcpuNum": 96 } ], "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "Python 3 (Data Science 3.0)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/sagemaker-data-science-310-v1" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" } }, "nbformat": 4, "nbformat_minor": 5 }

archived/workshops/llama2-7b-batching-throughput.ipynb (1,761 lines of code) (raw):